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Abstract 

The number of the non-shared edges of two phytogenies is a basic measure of 
the dissimilarity between the phylogenies. The non-shared edges are also the build¬ 
ing block for approximating a more sophisticated metric called the nearest neighbor 
interchange (NNI) distance. In this paper, we give the first subquadratic-time algo¬ 
rithm for finding the non-shared edges, which are then used to speed up the existing 
approximating algorithm for the NNI distance from O(n^) time to 0(n logn) time. 
Another popular distance metric for phylogenies is the subtree transfer (STT) dis¬ 
tance. Previous work on computing the STT distance considered degree-3 trees only. 
We give an approximation algorithm for the STT distance for degree-d trees with 
arbitrary d and with generalized STT operations. 


1 Introduction 

Phylogenies are trees whose leaves are labeled with distinct species. Different theories about 
the evolutionary relationship of the same species often result in different phylogenies. This 
paper is concerned with three well-known metrics for measuring the dissimilarity between 
two phylogenies, namely, the non-shared edge distance [1,11,14], the nearest neighbor in- 
terchange^NNl) distance [12,13] and the subtree transfer{STT) distance [7,8]. The hrst 
metric counts the number of edges that differentiate the phylogenies; the other two metrics 
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Figure 1; Examples of restricted-STT and STT operations. Tree (a) is transformed to 
tree (b) by a restricted-STT operation, while tree(a) is transformed to tree (c) by an STT 
operation. 


measnre the minimum total cost of some kind of tree operations required to transform 
one phylogeny to the other. For the NNI distance, an operation swaps two snbtrees over 
an internal edge; for the STT distance, an operation detaches a subtree from a node and 
re-attaches it to another part of the tree. 

In this paper we consider phylogenies of degrees d whose edges may carry weights.^ 
Given two weighted degree-d phylogenies T and T', an edge e in T is shared if for some edge 
e' in T', the removals of e and e' from T and T', respectively indnce the same partition of 
leaf labels, internal node degrees, and edge weights; otherwise, e is non-shared. Previously, 
non-shared edges could be found using a brute-force approach in 0{n^) time, where n is 
the number of leaves. If we restrict our attention to the partition of leaf labels only. Day 
[4] reduced the time to 0{n). We give an 0(nlogn)-time algorithm for Ending the general 
non-shared edges. 

Finding non-shared edges is a key step, as well as the most time consuming step, for 
approximating the NNI distance. In particular, for degree-3 phylogenies with weights 
or degree-d phylogenies with or withont weights, existing approximation algorithms take 
O(n^) time [3,10]. With our new non-shared edge algorithm, the time complexity of these 
approximation algorithms can all be improved to O(nlogn). Note that for nnweighted 
degree-3 trees, an 0(n logn)-time algorithm has already been obtained [11], which uses 
Day’s linear-time algorithm [4] to identify the non-shared edges. 

Previous work on STT distance focuses on degree-3 trees only [2,9]. In particular, in 
the course of transforming a degree-3 tree to another degree-3 tree, all intermediate trees 
are required to be of degree 3. In other words, the STT operation is restricted in the sense 
that the snbtree detached can only be re-attached to the middle of an edge, prodncing a 
new internal node with degree 3. See Figure 1 for an example. In this paper we stndy the 
STT distance for degree-d phylogenies for any d > 3 while also allowing an STT operation 
to re-attach the snbtree to either an internal node or the middle of an edge. 

An STT operation is charged by how far a subtree is transferred. More specifically, 
depending on whether the trees are unweighted or weighted, we connt respectively the 

^The magnitude of an edge weight, also known as branch length in genetics, may represent the number 
of mutations or the time required by the evolution along the edge. 
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Figure 2: The unweighted STT distance between T and T' is r2(n). Consider T and T' as 
weighted trees such that every internal edge has a unit weight (i.e., the highlighted edges 
in the figure). The weighted STT distance between T and T' is 1. In particular, the cost 
of transforming T to R, then to i?', and finally to T' is 0 + 1 + 0 = 1. 


number of the edges or the total weight of the edges^ between the nodes where detachment 
and re-attachment take place. We formally define the STT (respectively, restricted-STT) 
distance between two phylogenies as the minimum cost of transforming one to the other 
using STT (respectively, restricted-STT) operations. Unlike many other graph or tree 
problems, the unweighted version of the STT distance problem is not a special case of the 
weighted version. In particular. Figure 2 shows two phylogenies whose unweighted STT 
distance is Q{n), yet if we assign a unit weight to every edge of these phylogenies, their 
weighted STT distance is only 0(1). On the other hand, the unweighted STT distance 
is not necessarily bigger than the weighted one; Figure 3 shows two phylogenies whose 
unweighted STT distance is indeed smaller than the weighted one. 



Figure 3: T and T' are degree-3 phylogenies. The unweighted STT distance between T 
and T' (which is 3) is smaller than the weighted STT distance (which is 4). 


Consider degree-3 phylogenies. In the weighted case, we can prove that the STT distance 
is the same as the restricted-STT distance, and DasGupta et ah have shown that the latter 
can be approximated within a factor of 2 [2]. In the unweighted case, deriving a tight 
approximation algorithm is more difficult; the restricted-STT distance can be approximated 
within only a factor of O(logn) [2]. This result implies an approximation algorithm for the 
STT distance with the same performance. However, there are examples in which the STT 
distance is much smaller than the restricted-STT distance. It is natural to ask whether the 
STT distance can be approximated within a better factor. 

^This cost model is referred to as the linear cost model in the literature [2,3]. It is preferable to the 
unit cost model as the latter does not reflect the evolutionary distance. 


3 




Table 1: The approximation factors for the different variants of the STT distance. 



restricted-STT 
(degree-3 trees) 

STT 

(degree-d trees) 

weighted 

2 [2] 

2 

unweighted 

O(logn) [2] 

2d-4 


Consider degree-d phylogenies. First of all, it is worth mentioning that the restricted- 
STT distance is cx) as a restricted-STT operation can only produce an internal node of 
degree 3. In the weighted case, the STT distance can be approximated by adapting the 
algorithm by DasGupta et al for degree-3 trees [2], achieving the approximation factor of 

2. In the unweighted case, we give an algorithm to approximate the STT distance within a 
factor of 2d — 4. This result implies that for unweighted degree-3 trees, the approximation 
factor can be improved from O(logn) to 2 if the intermediate trees may not be necessarily 
degree-3 trees. Table 1 summarizes the approximation factors for the variants of the STT 
distance. 

The following summarizes the contributions of the paper. 

1. We give the hrst subquadratic (i.e., O(nlogn)) time algorithm for hnding the non- 
shared edges between two weighted phylogenies, thus improving the time complex¬ 
ity of the algorithms in [3,10] for approximating the NNI distance from 0{n^) to 
O(nlogn). 

2. We show that the problem of hnding the STT distance between two weighted degree- 
d phylogenies is equivalent to the problem of hnding the restricted-STT distance 
between two weighted degree-3 phylogenies. This result implies the following. 

(a) The STT distance between two weighted degree-d phylogenies can be approxi¬ 
mated within a factor of 2. 

(b) If the leaf labels of the trees are not distinct, it is NP-hard to compute the STT 
distance between two trees. 

3. We give an approximation algorithm with approximation ratio of 2d — 4 for hnding 
the STT distance between two unweighted degree-d phylogenies. 

4. We prove that it is NP-hard to compute the STT distance between two unweighted 
trees with leaves labeled by possibly non-distinct labels. 

The rest of this paper is organized as follows. Section 2 gives the 0(n logn)-time 
algorithm for hnding the non-shared edges between two weighted trees. Section 3 presents 
the results on computing subtree transfer distance for both weighted and unweighted cases. 
Finally, Section 4 shows that computing STT distance between two unweighted phylogenies 
with possibly non-distinct labels is NP-hard. 
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2 Finding non-shared edges between weighted trees 


In this section, we show how to hnd all non-shared edges between two weighted phylogenies 
in 0{n log n) time. Basically, we tranform the problem to a partition labeling problem which 
can be solved in O(nlogn) time. In Section 2.1, we dehne the partition labeling problem 
on two rooted trees and solve the problem in 0(nlogn) time. In Section 2.2, we dehne the 
non-shared edge problem and give an 0(nlogn)-time reduction from the non-shared edge 
problem to the partition labeling problem. 

The following multi-set notations are used in this section. Let M = {a, a,..., b, 6,... } 
be a multi-set of symbols. Let 5{M) be the set of distinct symbols in M. For each a G S{M), 
let be the number of occurrences of a in M. Let \M\ = Ylia£ 5 {M) Furthermore, the 
set operations for any two multi-sets M and N are dehned as follows: (i) M C if for 
each a G h(M), in M < in N. (ii) M = N ii M C N and N C M. (hi) M VJ N 
is dehned to be a multi-set containing m.dSL{^a in M, in A^} copies of a for each a in 
5{M)U5{N). 

2.1 The partition labeling problem for rooted trees 

In this section we dehne the partition labeling problem for two rooted trees with leaves 
labeled by the same multi-set of labels and solve the problem in 0(nlogn) time, where n 
is the number of leaves in either tree. In the following, let R and E! be two rooted trees 
with leaves labeled by the same multi-set S of labels, and let A be any subset of 5{S). 

For each internal node u in R, we dehne the following. 

• Let Lji{u) be the multi-set of leaf labels in the subtree of R rooted at u. 

• Let Lii{u)\A be the multi-set of leaf labels constructed from Lr{u) by deleting all 
labels which are not in A. 

Given R and R', let V be the union of the sets of internal nodes in R and R'. A mapping 
p : V ^ (where t = |fo|) is called a partition labeling for R and R' if for any u in R 
and V in R', p{u) = p{v) if and only if Lr{u) = Ljii{v). 

The partition labeling problem is to hnd a partition labeling for R and R'. Note that 
this partition labeling always exists. A straightforward approach is to compute all multi¬ 
sets of Lr{u) (and Lri{v)) hrst and then assign a unique integer to each distinct multi-set. 
However, each multi-set can be as large as 0(n), so this straightforward approach takes 
O(n^) time. 

To reduce the time complexity, we compute the multi-sets in an incremental manner, 
and start comparing them earlier based on the partial result. In particular, we assign a 
temporary label to represent each multi-set, such that two multi-sets are assigned the same 
label if they are equal. This helps in saving not only the space for storing the multi-sets, but 
also the time for comparing two multi-sets afterwards. The labels will further be updated 
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by a relabeling process, as long as more information about the multi-sets are computed. In 
the end, each distinct multi-set of Lr{u) (and Lri{u)) obtains a distinct label. 

The algorithm is presented in Figure 4, where Ra is dehned as the subtree of R induced 
by A. Precisely, Ra is a tree constructed by contracting R to retain only those leaves with 
labels in A and their least common ancestors. 

The algorithm is analysed below, and we begin with two supporting lemmas. 


Phase 1. For each label i in 5{S), hnd a parition labeling for i?p} and according 
to Lemma 2.3. 

Phase 2. Repeat the following procedure for log |5(S')| rounds: 

Let Ai,A 2 ,... be the sets of labels considered in the last step. 

1. Pair up ylj’s such that ^ 2 j-i = ^ 2 j-i U A 2 j. 

2 . Delete all A2/S and rename A2j-i as Aj. 

3. For each Aj, compute a partition labeling for Ra^ and R'^. based on the 
result of the last step and Lemma 2.4. 

Figure 4: Partition labeling for R and R'. 


Lemma 2.1 The induced subtree Ra can be constructed in 0(t) time where t is the number 
of leaves in R with labels in A. 

Proof. Using the algorithm in [ 6 ], with linear time preprocessing, we can answer a least 
common ancestor query in constant time. To construct Ra, we only need to answer 0{t) 
least common ancestor queries where t is the number of leaves in R with labels in A, so 
the lemma follows. □ 

Lemma 2.2 Let A and B be two disjoint subsets of 6{S). Let u be an internal node in 
Raub- Then 0 or Lr^{v) for some v in Ra- (Similarily, Lfi^^jg{u)\B = 0 

or Ljig{v) for some v in Rb-) 

Proof. If u is in Ra, LB^^Jg{u)\A = Lr^{u). It remains to consider u not in Ra. Suppose 
on the contrary that LB^^jg{u)\A 7 ^ 0 and LB^^jg{u)\A 7 ^ Lb^{v) for any internal node v in 
Ra- Let this u be the first one of this kind visited by a postorder traversal of Raub- 

By the construction of Ra, since u is not in Ra, u has at most one child s whose 
subtree contains leaf labels in A. If such an s exists, then = LR^^jg{s)\A. This 

contradicts the choice of u. If u has no such a child, LR^^jg{u)\A = 0. Thus such a u does 
not exist. A similar argument can be applied to the case of LB^^Jg\B. □ 

The next lemma implies that Phase 1 can be computed in 0{n) time. 
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Lemma 2.3 Let a G 5(5'). A partition labeling for R^a} and can be found in 0(t) 
time where t is the number of leaves in R with label a. 

Proof. Perform a postorder traversal in R^a}- Since only contains multiple copies 

of a, we only need to keep track during the traversal and assign this number to 

the internal node u. Apply the same procedure to R[a}- 

After Phase 1, we get the partition labeling for R^q for every label i G 5(5). Phase 2 
tries to merge the R{ifs incrementally until we get the partition labeling for Rs{s)- Below 
we describe the merging process. Let A and B be two disjoint subsets of 5(5). Let pi and 
P 2 be partition labelings for {Ra,R'a) and {Rb,R'b)i respectively. Now, we perform the 
following relabeling process on the internal nodes of Ra\jb and R'avjb- 

Relabeling Process: Consider Ravjb (similar for R'a\jb)- 

Step 1. Perform a postorder traversal. For each internal node u in Ra^jb visited, assign a 
2-tuple of integers (a, b) to u in the following manner: 

According to Lemma 2.2, if = 0, set a to 0; Otherwise, there exists a v such 

that LBj^yjB\A = Lb^{v), then set a to pi{v). Set b similarity by considering 

Step 2. Sort all 2-tuples of internal nodes by radix sort. Traverse the sorted list of these 
2-tuples, assign a new integer (starting from 1) to every distinct 2-tuple encountered. 
Assign this integer as a label to the corresponding internal node. 

In fact, the labels assigned to the nodes in the relabeling process form a valid partition 
labeling. We have the following lemma. 

Lemma 2.4 Given the partition labelings for {Ra, R'a) and {Rb,R'b), we can compute a 
partition labeling for Raub and R'aub tinie where t is the number of leaves in Raub- 

Proof. Perform the relabeling process on Ravjb- Since Lr^^jb{u) = LRAUBi.U)\A U 
LR^^g{p) = if and only if the corresponding 2-tuples assigned to 

p and q are identical. So, the labels assigned to the nodes in Step 2 form a valid parition 
labeling. 

Regarding the time complexity, in Step 1, during the postorder traversal, for each 
internal node u, if u is in Ra, then v = u. Otherwise, if there exists a child s of m in Raub 
with LB^^jg{s)\A = LB^{t), then v = t. If no such s exists, LB^^jg{u)\A = 0. So, this Step 
can be completed in linear time. Obviously, Step 2 can also be completed in linear time, 
so the lemma follows. □ 

Lemma 2.4 implies that each round of Phase 2 takes 0{n) time. This gives the following 
lemma. 

Lemma 2.5 Given R and R', a partition labeling for R and R' can be computed in 
O(nlogn) time where n is the number of leaves in R. 
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Proof. By Lemma 2.3, Phase 1 takes 0{n) time. By Lemma 2.4, each round in Phase 2 
takes 0{n) time, so the overall complexity is 0(nlog |h(S')|) = O(nlogn) time. Thus, the 
lemma follows. □ 

2.2 An 0(n log n)-time algorithm for finding non-shared edges 

In this section we show that the non-shared edge problem for weighted phylogenies can 
be solved by an 0(nlogn)-time reduction to the partition labeling problem on two rooted 
trees. 

Let T and T' be two weighted phylogenies with the same set of distinct leaf labels and 
the same multi-set of edge weights and internal node degrees. Recall that a shared edge is 
dehned as follows. 

An edge e in T is said to be shared (with respect to T') if there exists an edge e' in T' 
such that e and e' induce the same partition of leaf labels, internal node degrees, and edge 
weights in T and T', respectively; otherwise, e is non-shared. 

The non-shared edge problem is to hnd all non-shared edges in T (with respect to 
T') and all non-shared edges in T' (with respect to T). Figure 5 gives the details of the 
reduction from the non-shared edge problem to the partition labeling problem. Basically, 
edge weights and node degrees in T and T' will be represented by new labeled leaves in the 
two constructed trees R and R', respectively. 


1. Set R = T and R' = T'. 

2. Fix an arbitrary leaf with label a. Root R and R' at the internal nodes which 
attach to leaves with label a. 

3. Attach a new leaf to every internal node of R and R' such that the labels of 
such new leaves are the same if the corresponding internal nodes have the same 
degree. 

4. Attach a new leaf in the middle of every internal edge in both R and R' snch 
that the labels of such new leaves are the same if the original edges have the 
same weight. 

Figure 5: Construction of R and R' from T and T'. 

Note that R and R' have the same mnlti-set of leaf labels since T and T' have the same 
mnlti-set of leaf labels, edge weights, and node degrees. And the nnmber of leaves in R 
and R' is of 0{n) where n is the nnmber of leaves in T. 

Lemma 2.6 The eonstruetion of R and R' takes 0{n\ogn) time. 




Proof. The lemma follows since Step 4 takes at most O(nlogn) time, Step 3 takes 0{n) 
time, and Steps 1 and 2 take 0(1) time. □ 

The following lemma relates the non-shared edges problem and the partition labeling 
problem. 

Lemma 2.7 Given a partition labeling p for R and R!, let {u,v) be an edge in T and s be 
the unigue internal node between u and v in R. Then, the edge {u, v) is a non-shared edge 
in T (w.r.t. T') if and only if the label p{s) is unigue in p. 

Proof. Suppose that {u, v) is a non-shared edge in T. By the construction of R and R', 
Lii{s) must be unique. So, p{s) is unique in p. 

On the other hand, if {u, v) is a shared-edge in T, then there is another edge {u', v') 
in T' such that {u, v) and {u', v') induce the same partition of leaf labels, node degrees, 
and edge weights in T and T', respectively. Without loss of generality, let u and u' be 
the portion containing the leaf with label a. Then, u and u' are the ancestors of v and v' 
in R and R' respectively. Let s' be the unique internal node between u' and v'. By the 
construction of R and R', Lr^s) = Trims'). □ 

In conclusion, we have the following theorem. 

Theorem 2.8 The non-shared edges ofT and T' can be identified in O(nlogn) time. 

Proof. By Lemma 2.7, if a partition labeling for R and R' is given, we can identify all 
non-shared edges in T and T' in 0{n) time. Since the parition labeling problem can be 
solved in O(nlogn) time, the theorem follows. □ 

3 The STT distance between degree-ti phylogenies 

This section studies the problem of computing the STT distance between two degree-d phy¬ 
logenies in both weighted and unweighted cases. For the weighted case, we show that the 
problem of computing the STT distance between two weighted degree-d phylogenies (the 
weighted STT-d problem) is equivalent to the problem of computing the restricted-STT 
distance between two weighted degree-3 phylogenies (the weighted rFFT-dproblem). Since 
DasGupta et al [2] have shown that the weighted rSTT-3 problem is NP-hard and can be 
approximated within a factor of 2, the same results apply to the weighted STT-d problem. 
For the unweighted case, we give a new approximation algorithm with an approximation 
factor of 2d — 4 for hnding the STT distance between two degree-d phylogenies (the un¬ 
weighted STT-d problem). We also prove that the problem of computing the STT distance 
between two unweighted phylogenies with possibly non-distinct labels is NP-hard. 

Section 3.1 gives notations and defintions used in this section. Section 3.2 gives the 
result for the weighted case. Section 3.3 gives an approximation algorithm for computing 
the STT distance between two unweighted phylogenies. Section 4 shows that it is NP-pard 
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to compute STT distance between two unweighted phylogenies with possibly non-distinct 
labels. 

3.1 Preliminaries 

Recall that STT operation, restricted-STT operation, STT distance and restricted-STT 
distance are dehned as follows. Given a tree T (rooted or unrooted), a subtree transfer 
(STT) operation is dehned as follows. We select a subtree S from T. Suppose S is attached 
to a node u by an edge e. Pick another edge e' = {v,w) (or an internal node t) not in S. 
Detach e and S and re-attach them to a newly created node x in e' (or t). If n becomes 
degree 2 after removing S', merge the two edges connected to u into one. In the weighted 
version, let w{e) denote the weight of an edge e; we reqnire that if a new node x is created, 
then w{{v,x)) + w{{x,w)) = tc(e'); fnrthermore, if u is removed, the weight of the merged 
edge is the total weight of the two merging edges. 

An STT operation is called restricted if S is always re-attached to a new node inside 
an edge. An STT operation is charged by how far the subtree is transferred. Precisely, the 
cost of an STT operation is dehned as the nnmber of edges or the total weight of edges, 
for nnweighted and weighted version, respectively, on the shortest path from u to x (or t). 

The STT distance between two trees Ti and T 2 , denoted by STTdist(Ti, T 2 ), is dehned 
as the minimnm cost of transforming Ti to a tree which is leaf-label preserved isomorphic 
to T 2 using STT operations. The restricted-STT distance, denoted by rSTTdist(Ti, T 2 ), is 
dehned similarity by allowing only restricted-STT operations. 

Note that STTdist(Ti, T 2 ) = STTdist(T 2 , Tty. However, the corresponding eqnality may 
not hold for restricted-STT distance. For example, consider the case that Ti is a degree-4 
tree while T 2 is a degree-3 tree. It is possible to transform Ti to T 2 using restricted-STT 
only operations, bnt not possible vice versa. 

3.2 Weighted degree-d phylogenies 

This section shows that the problem of hnding the STT distance between two weighted 
degree-d phylogenies (the weighted STT-d problem) is eqnivalent to the problem of hnd¬ 
ing the restricted-STT distance between two weighted degree-3 phylogenies (the weighted 
rSTT-3 problem). We hrst show that the weighted STT-d problem can be rednced to 
the weighted rSTT-3 problem. Given a weighted degree-d phylogeny X, we construct a 
degree-3 phylogeny T from X as follows. 

Transformation from a degree-A; phylogeny to a degree-3 phylogeny: For each 
node u oi X with degree k > 3, let the edges that are attached to u be Cq, Ci,..., Ck-i- Pick 
one of the edges, say cq = (n, v). Greate k — 3 new nodes yi,y 2 ,..., yk-s on cq snch that yi 
is adjacent to u, yi is adjacent to yi-i for 2 < z < A: —3 and w{{yi,u)) = 0, w{{yi,yi-i)) = 0 
for 2 < i < A; — 3, tc((n, t/^-s)) = w{eo). Detach e* and the corresponding snbtree, and 
reattach them to node y^ for 1 < i < A: — 3. See hgnre 6 for an example. 
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Let T be the resulting tree after the transformation. Note that T is a degree-3 phylogeny 
and the transformation only uses restricted-STT operations of zero cost. We call T or any 
tree which can be generated by the transformation a degree-3 representation of X. Based 
on the above construction, we have the following facts. 

Fact 3.1 Let X be a weighted degree-d phylogeny. Let T and T' he two degree-3 represen- 
taions of X. Then, 

• rSTTdist(W, T) = 0 and STTdist(T, X) = 0, and 

• rSTTdist(T,T') = 0. 

Also, STT operations on X can be “simulated” by a sequence of restricted-STT oper¬ 
ations on its degree-3 representation T with the same cost. More precisely, we have the 
following lemma. 

Lemma 3.2 Let X be a weighted degree-d phylogeny and X' he the phylogeny constructed 
from X by one STT operation with cost c. Let T and T' he the degree-3 representation of X 
and X', respectively. Then, T can he transformed to T’ using a number of restricted-STT 
operations with the total cost c. 

Proof. For each edge e in X, there is a corresponding edge e' in T such that they induce 
the same bipartition of leaf labels. And if the nodes of X are given unique labels, then, for 
each node u in X, there is a corresponding node u' in T with the same label as u. Now, 
we simulate an STT operation on X in T as follows. 

If an STT operation moves a subtree S (in X) which is attached to u by an edge ci from 
u to an edge 62 , we simulate the operation in T by moving the edge e( and its attached 
subtree to the edge 63 . On the other hand, if an STT operation moves the subtree S and 
Cl from u to an internal node v, there are two cases. If v is of degree 3, then let e be any 
edge attached to v' in T, move e( and its subtree to e, and forming a new node a; on e 
with w{{x,v')) = 0. Otherwise, v must be of degree k > 3, and there must be an edge e 
attached to v' (in T) such that e induces a bipartition of leaf labels that cannot be induced 
by any edge attached to v. Intuitively, this is the new edge added when we transform u 
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from degree d to degree 3. Then, we move e[ and its subtree to e. It can be shown that T' 
is a degree-3 representation of X' and rSTTdist(T, T') = c. □ 

Now, we show that STTdist(Xi, X2) = rSTTdist(Ti, T2). 

Lemma 3.3 Let Xi and X 2 be two weighted degree-d phytogenies, where d > 3. Let Ti 
and T 2 be degree-3 representations of X\ and X 2 , respectively. Then, STTdist(Xi, X2) = 
rSTTdist(Ti, T2). 

Proof. To transform Xi into X 2 using STT operations, we can hrst transform Xi to 
Ti, then transform Ti to T 2 using restricted-STT operations, and dually transform T 2 
back to X 2 . In other words, STTdist(Xi, X 2 ) < rSTTdist(Xi, Ti) -|- rSTTdist(ri, T 2 ) -|- 
STTdist(T2, X 2 ). By Fact 3.1, rSTTdist(Xi, Ti) = 0 and STTdist(T2, X 2 ) = 0 . So, we 
have STTdist(Xi,X 2 ) < rSTTdist(ri,T 2 ). 

On the other hand, by Lemma 3.2, to tranform Ti into T 2 using restricted-STT opera¬ 
tions, we can simulate the transformation from Xi to X 2 on Ti to obtain T' where T' is a 
degree-3 representation of X 2 . Then, we tranform T' to T 2 . By Fact 3.1, rSTTdist(T', T 2 ) = 
0 . So, rSTTdist(Ti, T 2 ) < STTdist(Xi, X 2 ). The lemma follows. □ 

By Lemma 3.3, we have the following theorem. 

Theorem 3.4 The weighted STT-d problem is equivalent to the weighted rSTT-3 problem. 

Proof. Consider two weighted degree-d trees Xi and X 2 . Let Ri and R 2 be the 
degree-3 representations of Xi and X 2 , respectively. By Lemma 3.3, STTdist(Xi, X2) = 
rSTTdist(i?i, i?2). Thus, the weighted STT-d problem can be reduced to the weighted 
rSTT-3 problem. 

Given two weighted degree-3 trees Ti and T 2 , Ti and T 2 are its own degree-3 repre¬ 
sentations, respectively. By Lemma 3.3, rSTTdist(Ti, T 2 ) = STTdist(Ti, T 2 ). Hence, the 
weighted rSTT-3 problem can be reduced to the weighted STT-d problem. □ 

3.3 An approximation algorithm for unweighted STT distance 

This section gives an approximation algorithm for computing the STT distance between 
two unweighted phylogenies (the unweighted STT problem). The approximation factor is 
2d — 4, which is independent of the number of leaves, n. 

Given two phylogenies, we hrst dehne what a non-leaf-label-shared edge is, and give a 
lower bound on the STT distance between the phylogenies based on the number of non¬ 
leaf-label-shared edges in the phylogenies. 

Let T and T' be two phylogenies with the same set of leaf labels. An edge e in T is 
said to be leaf-label-shared (w.r.t. T'), if for some edge e' in T', e and e' induce the same 
partition of leaf labels; otherwise, e is said to be non-leaf-label-shared. 
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Lemma 3.5 Let T and T' he two degree-d phytogenies with the same set of leaf labels. Let 
b and b' denote the number of non-leaf-label-shared edges in T and T', respectively. Then 
STTdist(T,T') > max(6,6')- 

Proof. By viewing an edge as a partition of leaves, a sequence of STT operations with 
cost k can create at most k new edges and delete at most k edges. To transform T to T', 
we must either delete or create at least max{b, b') edges, because any non-leaf-label-shared 
edge of one tree is not contained in another. Thus, the STT distance is at least max{b, b'). 
□ 

Note that STT operations are reversible in the sense that if a tree Ti can be transformed 
to T 2 using a sequence a of STT operations, we can easily transform T 2 to Ti by reversing 
the operations in a with the same cost. Based on the following lemma and the reversibility 
of STT operations, we will derive a linear time approximation algorithm for computing the 
STT distance between two phylogenies. 

Lemma 3.6 (Theorem 4, [14]) Let T and T' he two unweighted degree-d phylogenies with 
the same set of leaf labels. If neither of them contains non-leaf-label-shared edges, then T 
and T' are isomorphic. 

Figure 7 details the approximation algorithm. The basic idea is to transform each 
phylogeny to one without non-leaf-label-shared edges using STT operations. 


1. Identify non-leaf-label-shared edges in T and T'. 

2. Transform T to Tg by “contracting” all non-leaf-label-shared edges using a se¬ 
quence (Ti of STT operations as follows: 

2.1. Partition the set of non-leaf-label-shared edges into groups such that if two 
edges are connected in T, they are in the same group. 

2.2. For each group, pick an internal node x which attaches to one of the edges. 
For each non-leaf-label-shared edge (x, y), let k be the degree of y. By STT 
operations, we detach {k—2) subtrees from y and re-attach them to x. Then, 
y becomes degree-2 and disappears. Repeat until all non-leaf-label-shared 
edges in the group are removed. 

3. Repeat Step 2 to transform T' to using a sequence a 2 of STT operations. 

4. Output the total cost c of all STT operations in a\ and 

Figure 7: An approximation algorithm for computing STT distance. 

The following lemma analyses the approximation factor and the time complexity of the 
algorithm. 
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Lemma 3.7 Let T and T' be two degree-d phytogenies with the same set of leaf labels. 
Then, STTdist(T, T') ean be approximated within a faetor of 2d —4 in 0{n) time. 

Proof. We apply the approximation algorithm presented in Figure 7 to T and T'. Since 
the leaf labels are distinct in T (T'), Step 1 can be done in 0{n) time [4], All other steps 
can be completed in 0{n) time, so the overall time complexity is 0(n). 

By Lemma 3.6, Tg and are isomorphic. Thus, by using the sequence of STT operations 
in (Ti and reversing the operations in a 2 , we can transform T to T' with cost c. 

To determine the approximation factor, it remains to bound the value of c. Note that 
all the STT operations are performed in Step 2, where we remove (contract) each the non¬ 
leaf-label-shared-edges. The cost of STT operations for each removal is at most d — 2. 
Thus, c < b{d — 2) + b'{d — 2) where b and b' are the number of non-leaf-label-shared-edges 
in T and T', respectively. By Lemma 3.5, c < (2d —4) ■max(5, 6') < (2d —4)STTdist(T, T'), 
so the lemma follows. □ 

4 Unweighted degree-ti phylogenies and NP-hardness 

This section studies the computational complexity for computing the STT distance be¬ 
tween two unweighted degree-d phylogenies. We prove the NP-hardness of a slightly more 
general problem. We prove that the problem of computing the STT distance between two 
unweighted trees with leaves labeled by possibly non-distinct labels is NP-hard. Our result 
also implies the NP-hardness of the weighted version of this problem, which was hrst proven 
in [2]. 

We consider the following decision problem. Given two degree-d unrooted trees T and 
T' and an integer t, the problem is to determine whether STTdist(T, T') < t. We show that 
this decision problem is NP-hard by reducing the Exact Cover by 3-Sets (X3C) problem 
[5] to it. The X3C problem is dehned as follows. Given a set S' = {si, S 2 , • • •, sag} for some 
integer q and a collection C = Ci, C 2 ,..., of subsets of S where Ci = Si^, Sjj} and 
= S', determine whether C contains an exact cover for S', that is, a sub-collection 
V = Di, D 2 ,..., Dq of C such that = S'. Given an instance of X3C problem, we 

construct two degree-d trees T and T' where d = 4n — g as follows. 

Construction of T: For each G* = Sjj, Sig}, we construct a subtree with three leaves 
labeled as Sj^, Si^, Sjg, respectively (see Figure 8(a)). In T, each of these subtrees is attached 
to a long arm where a long arm is made up of three short arms. Each short arm is a path 
with leaves labeled as Xq, Xi,..., x^^-i hanging on the path (see Figure 8(b)). In other 
words, besides 3q Si labels, we create unique Xi labels. All long arms meet at a single 
node M (see Figure 8(c)). 

Construction ofT': We construct 3q leaves with labels si,...,S 3 g, respectively. Each of 
these leaves is attached to a short arm as shown in Figure 8(d). The short arms meet at 
a node N. Besides the short arms, there are n — q long arms attached to N. To make T' 
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(b) 


Figure 8 : Trees T and T'. 

have the same multi-set of leaf labels as T, we create additional leaves with appropriate 
labels and attach these leaves to N directly. 

Let t = 3n^ + 6n. We show that STTdist(T, T') < f if and only if S has an exact cover. 
If S has an exact cover, there exists q long arms in T such that the set of all Si leaves at 
the end of these long arms is equal to S. To transform T to T', basically, each of these long 
arms is transformed into 3 short arms with one Si leaf attached at the end. For the other 
long arms of T, we move up all Si leaves to M. The whole proceduce can be done using 
STT operations of cost at most t. The detail steps are given below (c.f. [2]). Although T 
is an unrooted tree, for ease of description, if a STT operation moves a subtree towards M, 
we say that it moves the subtree up the tree, otherwise, we say that it moves the subtree 
down the tree. 

Without loss of generality, let {Ci, C 2 , ■ ■ ■, Cq} be an exact cover of S. There are two 
cases. 

Case 1. For the long arm corresponding to Ci where 1 < i < q, let Ci = {sp,Sg,Sr}. We 
hrst move the two leaves with labels Sp and Sq up -|- 1 edges (see Figure 9(a)) using 
STT operations of cost -|- 2. Note that the subtree R (see Figure 9(a)) will be a 
short arm in T'. The next step is to move the leave with label Sq and i? up -|- 1 
edges (see Figure 9(b)) using STT operations of cost -|- 2. Note that the subtree 
P will be another short arm in T'. The last step is to move both P and Rio M and 
make them attach to M directly. This requires STT operations of cost + 2. The 
total cost for each long arm is 3n^ + 6. So, the total cost of to transform these long 
arms into short arms is 3n^q -|- 6 g. 

Case 2. For each of the remaining long arms, move the subtree containing the three Sj 
leaves up to M using STT operations of cost 3n^, then using two more STT operations 
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to make each Sj leaf attached to M directly. The total cost of STT operations for 
these long arms is (3n^ + 2){n — q). 



Figure 9: Transforming a long arm in T to three short arms in T'. 

Thus, we can transform T to T' using STT operations with total cost at most + 6n. 
The correctness of the if-part is established. In the following, we focus on showing the 
correctness of the only-if part. That is, we show that if STTdist(T, T') < + 6n, then 

there is an exact cover of S. 

Observation 4.1 For any STT operation of cost k, we can decompose it into k STT 
operations of which each has cost one. 

From this point onwards, we regard each STT operation as of unit cost unless otherwise 
stated. In other words, each STT operation will move a subtree together with the edge 
attached to it from one end of an edge to the middle or the other end of the same edge. 
To prove the only-if part, for a given sequence of STT operations that transform T to T', 
we identify a set of effective STT operations. We show that each of these operations can 
be characterized by a unique edge (called an upward edge) in T. If the total number of 
STT operations is small (i.e., < 3n^ + 6n), then the number of effective operations must 
be large, that is, there must be a lot of upward edges and this implies the existence of an 
exact cover for S. We first give some preliminary dehnitions and concepts in Section 4.1. 
We then give a lower bound on the number of upward edges in Section 4.2. The proof of 
the only-if part is given in Section 4.3. 
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4.1 Definitions and concepts 

Let F be a given sequence of STT operations of total cost at most 3n^ + 6 n that transforms 
T to T'. By tracing the operations, there is a one-to-one correspondence of the leaves in T 
to those in T'. We can relabel the leaves of T by giving an extra index to the leaves with 
same labels. For example, we can use Xoq, a;o, 2 , • • • to distinguish leaves with label Xq. The 
leaves in T' can be relabeled according to the new labels of the corresponding leaves in 
T. After the relabeling, we can regard leaves in T (or T') as having distinct labels. Since 
the index j in Xij is not important, so we will refer to any particular Xij simply by Xi in 
the following. Similarly for Si leaves, we relabel them and refer to them using the same 
approach. In other words, we can now assume that all leaf labels are distinct with respect 
to F. 

Each edge e in T induces a bipartition of leaves, denoted as be- After the relabeling, 
these bipartitions are unique. Let B = { 6 i, 62 , • • •, ^snS+n} and B' = {b'l, b' 2 , ..., 
be the the set of bipartitions induced by the internal edges in T and T', respectively. Note 
that B n B' = ^. 

Let Ti be the resulting tree after performing the Ah STT operation in F. Let be 
the set of bipartitions induced by the internal edges in Tj. For any i, since the leaves are 
uniquely relabeled, the bipartitions in are unique. Let op be an STT operation in F 
that transforms Ti to Tj+i. After the operation, we either ( 1 ) delete a bipartition b ^ B^ 
and create a new bipartition b' G B'‘~^^] or (2) create a new bipartition without deleting 
any bipartiton in or (3) delete a bipartition in or (4) do not change the set of 
bipartitions. See Figure 10 for all these STT operations. 




Figure 10: Possible STT operations. 

An STT operation op is called an effective operation if op deletes a bipartition b E B^ 
and creates a new bipartition b' G (Case 1) such that b in B, b' in B' and no subsequent 
STT operations in F delete b'. The unique edge e in T which induces b is called an effective 
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edge and we say that e maps e if e induces b' in T'. Finally, an STT operation which is not 
effective is called an ineffective operation. 

Note that each effective operation op can be characterized by the pair of edges (e, e). The 
following lemma gives the bounds for the numbers of effective and ineffective operations. 

Lemma 4.2 If STTdist(T, T') < 3n^ + 6n, then (i) the number of effective operations is 
at least 3n^ — 6n and (ii) the number of ineffective operations is at most 12n. 

Proof. Since fl = 0, there must be STT operations that delete all bipartitions in B 
and create the bipartitions in B'. \B\ = 3n^ + n and \B'\ = 3n^ — n + q > 3n^ — n. Since the 
total number of STT operations is at most 3n^ + 6 n, the number of effective operations, 
each of which deletes a bipartition in B and creates a bipartition in B' , must be at least 
(3n^ + n) + {3n^ — n) — (3n^ + 6n) = 3n^ — 6n. The number of ineffective operations must 
be at most ( 3 ? 7 .^ + 6n) — {3n^ — 6n) = 12n. □ 

To help the analysis, we further classify effective edges in T. Let e be an effective edge. 
Let op be the effective operation that deletes the bipartition induced by e. If op moves a 
subtree up the tree, e is called an upward effective edge. Otherwise, it is called a downward 
effective edge. 

The number of upward effective edges is critical to the proof of the only-if part. In the 
next section, we will show that the number of upward effective edges will be large if the 
total number of STT operations is at most 3n^ + 6 n. 

4.2 Lower bound on the number of upward effective edges 

Consider some large enough n. This section shows that if STTdist(T, T') < 3n^ + 6n, then 
every long arm in T contains more than 2n^ upward effective edges. 

We hrst show that the number of downward effective edges is at most 89n. We classify 
the downward effective edges into the following three groups. For each downward effective 
edge, there is a corresponding STT operation that moves a subtree S and its attached edge 
e, called the carrying edge, from one end of some edge to the other end. The subtree S is 
said to be carried by the edge e. Let be be the bipartition induced by the carrying edge e. 
We count the number of downward effective edges in each of the following groups: (a) e is 
an external edge or b^ G B' , (b) be G B, and (c) e is not an external edge and be ^ BU B'. 

Case (a). In this case, the subtree carried by e must have exactly 0,1, 3n — 1 or 3n leaves 
with Si labels while the biparition induced by the downward effective edge has 3 leaves 
with Si labels on one side and 3n — 3 leaves with Sj labels on the other side. By checking 
all cases, it is not possible to produce a biparition in B' using the corresponding carrying 
edge. The number of downward effective edges in this group is 0. 

Case (b). In this case, since be is in B, let e* be the edge in T that induces the same be. 
Let the corresponding downward effective edge be ej and let Cj maps Cj. By checking all 
possible cases, we know that both bei and bej must contain the same set of 3 leaves with s* 
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labels in one side of the bipartition. From the structure of T, we know that e* and Cj must 
be in the same long arm A. Since ej is a downward effective edge, Cj is closer to M (the 
node where other long arms meet). 

Consider A and denote its edges by eo, ei,..., e 3 „ 2 _i ordered from M. The set of leaves 
L between e* and ej in T contains leaves of labels Xi mod n^i 2 ^(i+i) mod n 2 , • • •, mod n^- 

It can be easily verihed that after the STT operation, one side of hg- will contain leaves of 
these Xi labels only. And this partition has the following property. If it has m leaves of 
label Xp, it must contain at least m leaves of label Xp+i, ..., Otherwise, it cannot 

induce a bipartition in T'. Therefore, we deduce that if j ^ rA, 2n^, then i = j mod n^. 

When i = j mod we know that L contains same number of leaves with label 
Xo,Xi,... ,Xn' 2 -i. In T', the number of bipartitions in B' which have a side with equal 
number of leaves with labels xq, , Xn'^-i is at most 3n. Since each effective STT operation 
will be corresponding to a unique edge in T'. So, there are at most 3n downward effective 
edges when j ^ n^, 2rA. And if we include the case when j = rA, 2 n^, then the total number 
of downward effective edges in this group is at most 5n. 

Case (c). In this case, the bipartition induced by the carrying edge e is not in B or B'. 
First, we have the following observation. 

Observation 4.3 Let e and e' he downward effective edges such that their corresponding 
carrying edges induce the same bipartition of leaf labels. Then e and e' are in the same 
long arm in T. 

The next lemma shows that there are at most 7 downward effective edges in the same 
long arm in T such that the corresponding carrying edges induce the same bipartition of 
leaf labels. 

Lemma 4.4 For each carrying edge e, there are at most 7 downward effective edges induced 
by e. 

Proof. Fixed a long arm in T and denote its edges by cq, ei,..., 63^2 from M. Let 6 *^, Cjj, 

..., (A < A < ■ ■ ■ < im) be m downward effective edges induced by e. 

For k = 1,2, ...,m — 1, let be the set of leaves betweeen and Thus 

Lk = {Xif^ mod n 2 , • • •, a^pfc+i-i) mod n^}■ We claim that Lk U Lk+i contains at least one a;„ 2 _i 
leaf for all /c = 1,..., m — 2. The claim can be proved as follows. 

By contrary, assume Lk U Lk+i does not contains a;„ 2 _i. Then, the sets Lk and Lk+i 
should be {xq, ..., and {xb,...,Xc} respectively where a = ik mod n^, b = ik+i 

mod n^, c = {ik +2 — 1 ) mod n^, and 0 < a < b < c < n"^. 

Let 64 , e 4 _^^, e 4_^2 fo 64 , e 4 _^^, and e 4 _^ 2 ) respectively. Let R be the partition of 
bei^ which contains more than 3 s* leaves. Let R' be the set of leaves in R excluding the 
leaves in the subtree carried by e. Note that R' should have less than or equal to 1 s* leave. 
Otherwise, both partitions of in T' contains more than 1 s* leaf, which is impossible. 
Thus, R' contains all leaves below 64 of some arm A in T'. 
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Using the same argument for and 'w® can show that R'UL^ and R'UL^U 
contains all leaves below and respectively of A' in T'. Recall that Xa € and 
Xc G -hfc+i. By construction of T', c < a. We arrive at contradiction and the claim follows. 

Since there are only 3 a;„ 2 _i leaves in each arm of T, the above claim implies that there 
are at most 7 downward effective edges induced by e. □ 

By Observation 2 and Lemma 4.4, each carrying edge corresponds to at most 7 downward 
effective edges in T. Each such carrying edge e requires a distinct ineffective STT operation 
to delete the corresponding bipartition be ^ B U B', therefore, the number of downward 
effective edges in this group is at most 7 * 12n = 84n, since by Lemma 4.2, there are at 
most 12n ineffective STT operations. 

Summing up the number of possible downward effective edges, we have the following 
lemma. 

Lemma 4.5 The number of downward effective edges is at most 89n. 

Proof. Recall the classification of downward effective edges in this subsection. The number 
of such edges in cases (a), (b) and (c) are at most 0, 5n and 84n, respectively. Thus, the 
total number is at most 89n. □ 

Combining lemmas 4.2 and 4.5, we have the following. 

Lemma 4.6 If STTdist(T, T') < 3n^ + 6n, then T has at least 3n^ —95n upward effective 
edges. 

Proof. The number of upward effective edges 

= number of effective operations — number of downward effective edges 

> number of effective operations — 89n (by Lemma 4.5) 

> 3n^ — 6n — 89n 
= 3n^ — 95n. 

Thus, Lemma 4.6 follows. □ 

Based on Lemma 4.6, we have the following corollary. 

Corollary 4.7 For large enough n (n > 96), if STTdist(T, T') < 3n^ + 6n, then every 
long arm in T contains more than 2n^ upward effective edges. 

Proof. We prove the corollary by contradiction. Suppose one of the long arms contains at 
most 2n^ upward effective edges, then the number of upward effective edges in T at most 
(3n^ + l)(n — 1) + 2n^ = 3n^ — + n — 1. However, since STTdist(T,T') < 3n^ + 6n, 

by Lemma 4.6, T should has at least 3n^ — 95n upward effective edges. We arrive at a 
contradiction and the corollary follows. □ 
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4.3 Proof of the only-if part 

Now, we prove the only-if part. 

Lemma 4.8 If an upward effective edge e in a long arm A in T maps to an edge e in an 
arm A' in T', then the leaves below e in A' are originally the leaves below e in A. 

Proof. Consider a particular leaf i which is below e in A'. To prove by contradiction, 
suppose I is originally not a leave below e in A, i.e., I originally appears on the bipartition 
of e with more than 3n — 3 Si leaves. Since e is an upward effective edge, this means that 
after e is created, i will remain on the bipartition of e with more than 3n — 3 Si leaves, 
which is a contradiction. □ 

Lemma 4.9 Upward effective edges from distinct long arms in T cannot map edges in the 
same arm in T'. 

Proof. Let ei and 62 be upward effective edges from distinct long arms Ai and A 2 in T. 
Suppose on contrary that the edges Ci and 62 , which are mapped from ei and 62 , are in 
the same arm, say 4', in T'. Consider the unique leaf Xn^-i at the bottom of A' in T'. By 
Lemma 4.8, this unique leaf should be below ci in Ai in T. Similarly, we can show that 
the unique leaf should be below 62 in ^42 in T. The uniqueness of the leaf implies that ci 
and 62 are in the same arm. Thus, contradiction occurs and the lemma follows. □ 

Based on Corollary 4.7 and Lemmas 4.8 and 4.9, the short arms in T' can be divided 
into groups of 3. Each group corresponds to one long arm in T where the leaves of the same 
group are exactly those leaves of the corresponding long arm. Thus, we have the following 
theorem. 

Theorem 4.10 //STTdist(T, T') < 3n^ -f 6 n, then S has an exact cover. 

Proof. Suppose STTdist(T, T') < 3n^ + 6 n. By Corollary 4.7, every long arm in T contains 
more than 2n‘^ upward effective edges. By Lemma 4.9, there exist at most n — q long arms 
in T whose upward effective edges can create edges in the n — q long arms in T'. 

Let R be the set of remaining (at least q) long arms in T. Since each long arm in R 
contains more than 2n^ upward effective edges and each short arm in T' contains edges, 
the upward effective edges on each long arm in R create edges in at least 3 short arms in 
T'. As there are 3q short arms in T', we conclude that R contains exactly q long arms and 
each short arm in T' has at least 1 edge created from some upward effective edge in long 
arm in R. By Lemma 4.8, every Si at the bottom of the short arms of T' should appear 
in the long arms in R. It implies that the set of leaves with label Si at the bottom of the 
q long arms must be equal to those at the bottom of the 3q short arms. In other words, 
there is an exact cover of S. This concludes the correctness of the only-if part. □ 
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5 Conclusions 

In this paper, we have explored different metrics for comparing phylogenies. We have 
devised an 0(nlogn)-time algorithm for computing the non-shared edges, which in turn 
reduces the time complexity of existing approximation algorithms for NNI distance from 
0{v?) to O(nlogn). On the other hand, we have extended the study of STT distance 
to general degree-d trees. For weighted case, we have shown that the STT distance of 
two degree-d trees is equivalent to the restricted STT distance of two degree-3 trees. For 
unweighted case, we have given an algorithm that approximates STT distance within a 
factor of 2d — 4. Also, we have shown that computing STT distance between two non- 
distinctly labeled trees is NP-hard. 

For future work, we would like to know whether there exists a linear-time algorithm 
for computing the non-shared edges, and whether computing STT distance between two 
distinctly labeled trees is NP-hard. 
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